Neural network engine

ABSTRACT

A neural network engine comprises a plurality of floating point multipliers, each having an input connected to an input map value and an input connected to a corresponding kernel value. Pairs of multipliers provide outputs to a tree of nodes, each node of the tree being configured to provide a floating point output corresponding to either: a larger of the inputs of the node; or a sum of the inputs, one output node of the tree providing a first input of an output module, and one of the multipliers providing an output to a second input of the output module. The engine is configured to process either a convolution layer of a neural network, an average pooling layer or a max pooling layer according to the kernel values and whether the nodes and output module are configured to output a larger or a sum of their inputs.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.15/955,426 filed Apr. 17, 2018, which application relates toInternational PCT Application No. PCT/EP2016/081776 (Ref: FN-481-PCT),the disclosures of which are incorporated herein by reference in theirentireties.

FIELD

The present invention relates to a neural network engine.

BACKGROUND

A processing flow for typical Convolutional Neural Network (CNN) ispresented in FIG. 1. Typically, the input to the CNN is at least one 2Dimage/map 10 corresponding to a region of interest (ROI) from an image.The image/map(s) can comprise image intensity values only, for example,the Y plane from a YCC image; or the image/map(s) can comprise anycombination of colour planes from an image; or alternatively or inaddition, the image/map(s) can contain values derived from the imagesuch as a Histogram of Gradients (HOG) map as described in PCTApplication No. PCT/EP2015/073058 (Ref: FN-398), the disclosure of whichis incorporated by reference, or an Integral Image map.

CNN processing comprises two stages:

-   -   Feature Extraction (12)—the convolutional part; and    -   Feature classification (14).

CNN feature extraction 12 typically comprises a number of processinglayers 1 . . . N, where:

-   -   Each layer comprises a convolution followed by optional        subsampling (pooling);    -   Each layer produces one or (typically) more maps (sometimes        referred to as channels);    -   The size of the maps after each convolution layer is typically        reduced by subsampling (examples of which are average pooling or        max-pooling);    -   A first convolution layer typically performs 2D convolution of        an original 2D image/map to produce its output maps, while        subsequent convolution layers perform 3D convolution using the        output maps produced by the previous layer as inputs.        Nonetheless, if the input comprises say a number of maps        previously derived from an image; or multiple color planes, for        example, RGB or YCC for an image; or multiple versions of an        image, then the first convolution layer can operate in exactly        the same way as successive layers, performing a 3D convolution        on the input images/maps.

FIG. 2 shows an example 3D convolution with a 3×3×3 kernel performed bya subsequent feature extraction convolution layer of FIG. 1. The 3×3×3means that three input maps A, B, C are used and so, a 3×3 block ofpixels from each input map is needed in order to calculate one pixelwithin an output map.

A convolution kernel also has 3×3×3=27 values or weights pre-calculatedduring a training phase of the CNN. The cube 16 of input map pixelvalues is combined with the convolution kernel values 18 using a dotproduct function 20. After the dot product is calculated, an activationfunction 22 is applied to provide the output pixel value. The activationfunction 22 can comprise a simple division, as normally done forconvolution, or a more complex function such as sigmoid function or arectified linear unit (ReLU) activation function of the form:y_(j)=h(x_(j))=max(0,x_(j)) as typically used in neural networks.

In this case, for 2D convolution, where a single input image/map isbeing used, the input image/map would be scanned with a 3×3 kernel toproduce the pixels of a corresponding output map.

Within a CNN Engine such as disclosed in PCT Application No.PCT/EP2016/081776 (Ref: FN-481-PCT) a processor needs to efficientlyimplement the logic required to perform the processing of differentlayers such as convolution layers and pooling layers.

SUMMARY

According to the present invention there is provided a neural networkengine according to claim 1.

Embodiments of the present invention incorporate a module for outputtingfloating point results for pooling or convolution layer operations in aneural network processor.

By sharing the resources needed for both layer types, the processingpipeline within the neural network engine can be kept simple with thenumber of logic gates as low as possible.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of example,with reference to the accompanying drawings, in which:

FIG. 1 shows a typical Convolutional Neural Network (CNN);

FIG. 2 shows an exemplary 3D Convolution with a 3×3×3 kernel;

FIG. 3 shows a typical floating point adder;

FIG. 4 shows a tree for implementing 3×3 convolution kernel;

FIGS. 5(a) and (b) show trees for 3×3 average and maximum poolingrespectively;

FIG. 6 shows processing logic configured for implementing aconvolutional network layer according to an embodiment of the presentinvention;

FIG. 7 shows the processing logic of FIG. 6 configured for implementingan average pooling layer;

FIG. 8 shows the processing logic of FIG. 6 configured for implementinga max pooling layer;

FIG. 9 illustrates a conventional ReLU activation function;

FIG. 10 illustrates a PReLU activation function implemented by anembodiment of the present invention;

FIG. 11 shows the output module of FIGS. 6-8 with an interface forimplementing the PReLU activation function of FIG. 10 in more detail;

FIG. 12 illustrates a conventional PEAK operation; and

FIG. 13 shows a variant of the processing logic of FIGS. 6-8 foradditionally implementing either a PEAK operation or two parallel 2×2Max pooling operations.

DETAILED DESCRIPTION OF THE EMBODIMENTS

A floating point (FP) representation is usually employed for both theconvolution kernel weights and image/map values processed by a CNNengine. A typical format suited for a hardware CNN engine implementationis 16 bit, half precision floating point (IEEE 754-2008). However, 8-bitFP representations can also be used and PCT Application No.PCT/EP2016/081776 (Ref: FN-481-PCT) discloses how a default FP exponentbias can changed to a custom value to avoid needing to use higherprecision FP formats.

FIG. 3 shows an exemplary conventional 16 bit FP (IEEE 754-2008)floating point adder—clearly this structure is similar for anyprecision. As is known, such an adder must know which of two operandscomprising s1/exp1/m1 and s2/exp2/m2 respectively is larger in order toalign the mantissa of the operands. Thus, the adder shown includes acomparator 40 producing a binary output (O/P) indicating which of thefloating point operands is larger.

The sign bits s1,s2 of these operands are in turn used throughmultiplexer 42 to select between a version of m1 shifted according tothe difference between exp2 and exp1 subtracted from m2; or the shiftedversion of m1 added to m2; and through multiplexer 44 to select betweena version of m2 shifted according to the difference between exp2 andexp1 subtracted from m1; or the shifted version of m2 added to m1. Nowusing multiplexers 46, 48 and 50 depending on the output O/P of thecomparator 40, the output of either multiplexer 42 or 44 is chosen togenerate the mantissa for the output of the adder, the value of themantissa is also used to adjust whichever of exp1 or exp2 are selectedby multiplexer 48 to generate the exponent for the output of the adder,while the sign for the output of the adder is chosen as either s1 or s2.

Note that this adder is shown for exemplary purposes only and thepresent invention is equally applicable to any equivalent floating pointadder.

Referring now to FIG. 4, it will be seen that convolution layers need:logic implementing floating point multipliers (30) and adder logic (32)to implement the dot product function (20) shown in FIG. 2 as well aslogic for the activation function (22). Note that any 3D convolutionlayers can be broken into a longer chain of multipliers whose productsare then added in a deeper tree than shown. Thus, the 3×3×3 convolutionshown in FIG. 2 might require 27 multipliers. The logic shown in FIG. 4can therefore be expanded to deal with the largest form of convolutionto be performed by a given CNN engine and where smaller or simpler 2Dconvolutions are to be performed, then for example, selected non-activeweights within the kernel can be zeroed.

Separately, FIG. 5(a) shows that average pooling layers need: adders(34) as well as logic (36) for implementing a multiplication by 1/kernelsize; while FIG. 5(b) shows that maximum pooling layers need comparators(38).

Looking at the trees of FIGS. 4 and 5, it will be seen that the adder(32, 34) and comparator (38) trees have similar structures with eachnode in a tree either producing a sum of its inputs or a maximum of itsinputs respectively.

It will also been seen that any floating point adder such as the adderof FIG. 3 already comprises comparator circuitry 40 for determiningwhich of its operands is larger. Thus, with little additionalconfiguration circuitry, any such adder can be converted to eitherselect the sum of its input operands (as normal) or the larger of theoperands fed to the comparator based on a single control input. Thelogic paths producing these results are separate and so thisfunctionality can be implemented without effecting the timing of theadder.

Referring now to FIG. 6, there is shown circuitry 70 for enabling a CNNengine to process either a convolutional, max pooling or average poolinglayer of a neural network according to an embodiment of the presentinvention. The circuitry 70 comprises a first layer of conventionalfloating point multipliers 30 such as are shown in FIG. 4. Now insteadof a tree of floating point adder nodes 32 as shown in FIG. 4, there isprovided a tree 32′ of configurable nodes which can be set to eitherproduce a sum of their inputs or a maximum of their inputs according tothe value of an Op_sel control input.

In the embodiment, the output of the final node in the tree 32′ and anoutput of one of the multipliers 32 are fed to an output module 60 asIn0 and In1 respectively.

The circuitry 70 is configured to operate in a number of differentmodes:

In FIG. 6, the tree 32′ is configured for 3×3 2D convolution by havingthe nodes perform addition i.e. Op_sel is set to indicate addition andfed to each node. Now trained weight values 18 for a kernel are combinedwith input image/map values 16 typically surrounding a given pixelcorresponding to in11 in the multipliers 30 as required, before beingfed through the tree 32′ to the output module 60. An activation functionenable signal is fed to the output module 60 and this will cause theoutput module 60 to apply a required activation function to the resultof an operation on In0 and In1 determined according to the value ofOp_sel, in this case addition, as will be explained in more detailbelow, before providing the convolution output for the given pixel ofthe input image/map.

FIG. 7 shows that the same tree 32′ can configured for average poolingby setting the weight values 18 in the kernel to 1/9, where 9 is thenumber of (active) cells in the kernel, and having the nodes in the tree32′ configured to perform addition. In this case, the output module 60simply needs to add In0 and In1 to provide its output according to thevalue of Op_sel. Thus, no activation function (or a simple identityfunction) need only be applied to the operational result of In0, In1.

Finally, FIG. 8 shows that the same tree 32′ can be configured for maxpooling by setting the (active) weight values 18 in the kernel to 1 andhaving the nodes in the tree 32′ configured as comparators only toproduce a maximum of their operands. In this case, the output module 60simply needs to select the larger of In0 and In1 as its output accordingto the value of Op_sel. Again, no activation function (or a simpleidentity function) need only be applied to the operational result ofIn0, In1.

Again, the trees presented here are just an example for 3×3 kernels.Other kernel sizes can be easily implemented.

As indicated, convolution needs an activation function 22 to be appliedto the dot product 20 of the input image/map values 16 and weight kernelvalues 18. Currently, the most used activation function are Identityfunction i.e. no activation function is applied, ReLU and PReLU.

As shown in FIG. 9, ReLU (Rectified Linear Unit), a very simplefunction, is defined as follows:

-   -   ReLU(x)=x for x>0    -   ReLU(x)=0 for x<=0

PReLU (Parametric ReLU), a more general form of ReLU function, isdefined as follows:

-   -   pReLU(x)=x for x>0    -   pReLU(x)=x/slope_coef for x<=0

Some embodiments of the present invention do not employ a precise slopeand instead only use a slope with values that are a power of two.Although such slopes can be used as an approximation of a specifiedslope, this approach works best, if the network is trained with theapproximated PReLU function, rather than approximating the PReLUfunction after training.

In any case, such an approximation needs two inputs:

-   -   Coef—The logarithm of the slope coefficient    -   PSlope—Flag that indicates if the slope is negative or positive.

Thus, referring to FIG. 10, the PReLU function becomes:

-   -   pReLU(x)=x for x>0

pReLU(x)=x/(2{circumflex over ( )}Coef) for x<=0 and PSlope=0

pReLU(x)=−x/(2{circumflex over ( )}Coef) for x<=0 and PSlope=1

Looking at the approximated PReLU activation function, it will be seenthat it only needs to modify the exponent of the input.

Referring now to FIG. 11 and based on the observations andapproximations discussed above, the output module 60 can implement theaddition, maximum, ReLU and PReLU functions as required. The Op_selinput is used to select between the addition or maximum of the twooperands In0 and In1.

The activation input can have two bits and can select between threetypes of activation as shown in Table 1. It will therefore beappreciated that one or more activation functions can be added,especially by increasing the number of control bits.

TABLE 1 Activation 0 0 No Activation 0 1 ReLU 1 0 PReLU 1 1 NotSupported

For PReLU, the coefficients can be coded with 4 bits, representing thelogarithm of the coefficient:

4 bit PReLU Coef PReLU slope coefficient 0000 1 0001 2 0010 4 . . . . .. 1111 2{circumflex over ( )}15

The output module 60 can take advantage of the floating point numberpipeline with division by the power of two coefficient being done bysubtracting the 4 bit PReLU coefficient instruction field value from theFP16 exponent value of the sum of In0 and In1 (or possibly the larger ofthe two, as will be explained below).

Using this scheme, an identity activation function can be implemented asa form of PReLU with PSlope=0 and Coef=0000, thus enabling theactivation input to be defined with a single binary activation inputwith settings ReLU=0 or PReLU=1, so that the PReLU function shown inFIG. 10 could be modified as follows:

-   -   pReLU(x)=x for x>0

pReLU(x)=(x/(2{circumflex over ( )}Coef) for x<=0 andPSlope=0)*Activation Input

pReLU(x)=(−x/(2{circumflex over ( )}Coef) for x<=0 andPSlope=1)*Activation Input,

so enabling the function to be implemented in a single pipeline.

The special cases for subnormal/infinite can treated by the logicimplementing the PReLU activation function in the following way:

-   -   If the input is a negative, NaN or infinite, the input value is        passed to the output unchanged;    -   If the result is a subnormal, the result is clamped to 0.

It will be appreciated that normally, an activation function other thanthe identity function would only be used when processing convolutionlayers. However, using an output module 60 with configuration controlssuch as shown in FIG. 11, it would also be possible to apply anon-linear activation function to a pooling result. Although not common,this option could be useful in some situations.

Referring now to FIG. 12, a further form of layer which can beincorporated within a network can involve detecting peaks within animage/map. A PEAK operation scans a source image/map with an M×N window,4×4 in FIG. 12, in raster order with a given step size, for example, 1,and produces a map containing a list of the peaks found—a peak being animage/map element that is strictly greater in value than any of thesurrounding elements in a window.

Referring to FIG. 13, implementing a PEAK operation again requires thenodes of the tree 32′ to be configured as comparators, as in the case ofmax pooling. The only condition is to have the middle element of the M×Nwindow, corresponding to the pixel location of the image/map beingprocessed, connected to the multiplier whose output is fed directly toIn1 of the output module 60′ i.e. in the example shown in FIG. 13, in11and in22 are swapped. The output module 60′ now produces two outputs,one indicating the peak value (as for max pooling) and a second outputindicating if In1 is greater than In0 i.e. that In1 is a valid peak.

One way to provide this second output is for the second output tocomprise a binary value indicating the maximum operand e.g. 0 for In0 or1 for In0. Such an output could also find application for functionsother than PEAK.

In a still further variant, it is also possible for the tree 32′ tosimultaneously provide a result of two pooling operations. In theexample of FIG. 13, the tree 32′ produces two 2×2 max pooling operationsin parallel. In this case, the outputs are drawn from the two adders32(1), 32(2) in the penultimate stage of the tree. These outputs couldalso be fed through an activation function, if required, but in anycase, this can enable the step size for a pooling operation to beincreased, so reducing processing time for an image/map. Again, this mayrequire a re-organisation of the input image/map values so that thedesired 2×2 sub-windows from the image values 16 are fed via themultipliers 30 to the appropriate nodes of the tree 32′.

It will also be seen that the 2×2 pooling approach shown in FIG. 13 canbe adapted to provide average pooling by setting the weight values 18 to¼ and configuring the nodes of the tree 32′ for summation.

It will be appreciated that where only particular sized windows for 2Dor 3D convolution or pooling were to be processed, the connections frominput image/map and kernel values 16, 18 to the multipliers 30 couldhardwired. However, where window sizes can vary, then especially forpeak layer processing and sub-window pooling, multiplexer circuitry (notshown) would need to be interposed between the inputs 16, 18 for theengine and the multipliers 30.

The above described embodiments can be implemented with a number ofvariants of a basic floating point adder:

-   -   Normal FP adder used within the output module 60, 60′;    -   FP adder with activation function used within the output module        60, 60′;    -   Combined adder and max used within the tree 32′; and    -   Combined adder and max with activation function used within the        output module 60, 60′.

Using these variants of adder, a common engine such as the CNN enginedisclosed in PCT Application No. PCT/EP2016/081776 (Ref: FN-481-PCT) canbe reconfigured for convolution, pooling or peak operations with minimalextra logic by comparison to implementing any given layer independently.

1-8. (canceled)
 9. A neural network engine configured to receive atleast one set of values corresponding to a pixel of an input map and acorresponding set of kernel values for a neural network layer of aneural network, the neural network engine comprising: a plurality ofmultipliers, each multiplier having a first operand input configured tobe connected to an input map value and a second operand input configuredto be connected to a corresponding kernel value; and pairs ofmultipliers of the plurality of multipliers providing respective outputsto respective input nodes of a tree of nodes, each node of the treebeing configured to provide an output corresponding to either: a largerof inputs of the node; or a sum of the inputs of the node; wherein anoutput of the tree providing a first input of an output logic, and anoutput of one of the plurality of multipliers providing a second inputof the output logic; wherein, based at least in part on theconfiguration of the nodes of the tree, the output logic is configurableas a convolution layer of the neural network, an average pooling layerof the neural network, and a max pooling layer of the neural network.10. A neural network engine according to claim 9, wherein when theneural network engine is configured as the average pooling layer of theneural network, each of the kernel values comprises a valuecorresponding to 1/(a window size of the set of values), the nodes ofthe tree are configured to sum their inputs and the output logic isconfigured to sum first and second inputs and to provide the sum as theoutput of the output logic.
 11. A neural network engine according toclaim 9, wherein when the neural network engine is configured as the maxpooling layer of the neural network, each of the kernel values comprisesa value equal to 1, the nodes of the tree are configured to output alarger of their inputs and the output logic is configured to output alarger of its first and second inputs as the output of the output logic.12. A neural network engine according to claim 9, wherein when theneural network engine is configured as the convolution layer of theneural network, each of the kernel values comprises a trained value forthe layer, the nodes of the tree are configured to sum their inputs andthe output logic is configured to sum its first and second inputs, toapply an activation function to the sum and to provide an output of theactivation function as an output of the output logic.
 13. A neuralnetwork engine according to claim 12, wherein the activation function isone of: an identity function, a rectified linear unit function, or aparametric rectified linear unit function.
 14. A neural network engineaccording to claim 12, wherein the activation function is defined by abinary slope parameter and a slope coefficient.
 15. A neural networkengine according to claim 9, wherein the neural network engine isfurther configured to provide a plurality of outputs from penultimatenodes of the tree when the neural network engine is configured as theaverage pooling layer or the max pooling layer.
 16. A neural networkengine according to claim 9, wherein the neural network engine isfurther configured to implement logic to perform at least one of:process the convolution layer, process the average pooling layer, orprocess the max pooling layer.
 17. A neural network engine according toclaim 9, wherein the output logic is further configured to receive acontrol input value and process, based at least in part on the receivedcontrol input value, at least one of: the convolution layer, the averagepooling layer, or the max pooling layer.
 18. A device comprising:hardware programmed to: configure multipliers, each multiplier having afirst operand input configured to be connected to an input map value anda second operand input configured to be connected to a correspondingkernel value; provide the first operand input and the second operandinput to an input node of a tree of nodes, the input node of the treebeing configured to provide an output corresponding to either: a largerof the first operand input and the second operand input; or a sum of thefirst operand input and the second operand input; provide an output ofthe tree to a first input of an output logic and provide an output ofone of the multipliers to a second input of the output logic; and basedat least in part on the configuration of the input node of the tree,configure the output logic to implement at least one of: a convolutionlayer of a neural network; an average pooling layer of the neuralnetwork; or a max pooling layer of the neural network.
 19. A deviceaccording to claim 18, wherein the output logic is further configured toreceive a control input value and implement, based at least in part onthe received control input value, at least one of: the convolutionlayer, the average pooling layer, or the max pooling layer.
 20. A deviceaccording to claim 18, wherein when the output logic is configured toimplement the convolution layer of the neural network, the nodes of thetree are configured to sum their inputs and the output logic isconfigured to sum its first and second inputs, to apply an activationfunction to the sum and to provide an output of the activation functionas an output of the output logic.
 21. A device according to claim 18,wherein when the output logic is configured to implement the averagepooling layer of the neural network, the nodes of the tree areconfigured to sum their inputs and the output logic is configured to sumits first and second inputs and to provide the sum as the output of theoutput logic.
 22. A device according to claim 18, wherein when theoutput logic is configured to implement the max pooling layer of theneural network, the nodes of the tree are configured to output a largerof their inputs and the output logic is configured to output a larger ofits first and second inputs as the output of the output logic.
 23. Amethod comprising: configuring multipliers, each multiplier having afirst operand input configured to be connected to an input map value anda second operand input configured to be connected to a correspondingkernel value; providing the first operand input and the second operandinput to an input node of a tree of nodes; providing an output of thetree to a first input of an output logic; providing one of themultipliers to a second input of the output logic; and based at least inpart on a configuration of the input node of the tree, implementing, bythe output logic, at least one of: a convolution layer of a neuralnetwork; an average pooling layer of the neural network; or a maxpooling layer of the neural network.
 24. A method according to claim 23,further comprising receiving a control input and wherein implementing bythe output logic is based at least in part on a value associated withthe control input.
 25. A method according to claim 23, wherein the inputnode of the tree is configured to provide an output corresponding toeither: a larger of the first operand input and the second operandinput; or a sum of the first operand input and the second operand input.26. A method according to claim 23, wherein when implementing by theoutput logic comprises implementing the convolution layer of the neuralnetwork, the output logic comprises summing the first input and thesecond input of the output logic, applying an activation function to thesum and providing an output of the activation function as an output ofthe output logic.
 27. A method according to claim 23, wherein whenimplementing by the output logic comprises implementing the averagepooling layer of the neural network, the output logic comprises summingthe first input and the second input and providing the sum as an outputof the output logic.
 28. A method according to claim 23, wherein whenimplementing by the output logic comprises implementing the max poolinglayer of the neural network, the output logic comprises outputting alarger of the first input and the second input.